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ZMOB  is  a  multi-microprocessor  system  consisting  of 
256  Z80A  microprocessors  that  communicate  via  a  fast 
cyclic  shift-register  bus.  This  paper  discusses  the 
efficient  use  of  ZMOB  for  various  types  of  image  processing 
operations,  including  point  and  local  operations,  discrete 
transforms,  geometric  operations,  and  computation  of  image 
statistics . 
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1. 


Introduction 


1.1  ZMOB 

ZMOB  [1]  is  a  multi-microprocessor  system  with  a 
ring-like  inter-processor  communication  system  called  the 
"conveyor  belt".  The  current  configuration  has  256  proces¬ 
sors  and  is  capable  of  executing  a  total  of  about  100  mil¬ 
lion  instructions  per  second.  This  section  explains  features 
of  the  conveyor  belt  architecture  exploited  in  processor 
communication. 

The  conveyor  belt  allows  any  processor  to  communicate 
with  any  other,  at  a  speed  so  great  that  it  is  unnoticeable 
to  the  processor.  Asynchronously,  processors  may  compute 
data  and  pass  intermediate  results  among  each  other.  The 
conveyor  belt  also  supports  tightly-synchronized  ("lock-step") 
parallel  image  processing  algorithms  by  allowing  processors 
to  all  communicate  data  in  an  organized  way  (e.g.,  a  "pass- 
right"  sequence)  or  rapidly  pass  blocks  of  data  among  one 
another  (termed  "burst  mode"). 

However,  in  instruction-level  lock-step  mode  (where 
absolute  synchronous  timing  is  crucial) ,  not  all  patterns 
of  data  exchange  can  occur  during  an  infinitesimal  communi¬ 
cation  step.  For  example,  no  processor  can  receive  data 
f  rom  more  than  one  processor  or  send  data  to  more  than  one 
specific  processor  at  the  same  Lime. 
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1 . 2  Image  Processing _ on  ZMOB 

This  paper  deals  with  the  efficient  use  of  ZMOB  for 
performing  various  types  of  image  processing  operations, 
including  point  and  local  operations,  discrete  transforms 
geometric  operations,  and  computation  of  image  statistics 
The  aim  is  to  make  the  fullest  possible  use  of  ZMOB's 
parallelism,  so  as  to  achieve  a  speedup  by  a  factor  pro¬ 
portional  to  256,  the  number  of  processors.  To  this  end, 
we  consider  how  the  image  data  should  be  partitioned 
among  the  processors,  and  how  the  operations  should  be 
segmented  into  computation  and  communications  steps.  Wc 
also  compare  ZMOB  processing  with  performing  operations 


on  the  host  VAX  itself. 


2. 


Point  and  Local  Operations 


A  point  operation  on  an  image  computes  a  new  value 
for  each  pixel  as  a  function  of  the  old  value,  independent 
of  the  values  of  other  pixels.  To  perform  such  an  opera¬ 
tion  on  ZMOB,the  image  is  divided  into  256  parts  in  any 
convenient  way;  each  ZMOB  processor  receives  one  part 
from  the  host  VAX  and  operates  on  its  pixels;  and  the 
results  are  returned  to  the  host  VAX. 

Let  Cz  and  C  be  he  times  for  a  ZMOB  processor  and 

for  the  VAX,  respectively,  to  perform  the  given  operation 
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on  one  pixel.  Let  N  be  the  number  of  pixels  in  the 

image,  and  let  r  be  the  time  required  to  pass  one  pixel 

from  the  VAX  to  ZMOB  or  vice  versa  via  the  UN I BUS. 

Then  the  time  required  to  perform  the  operation  on  the 
.  .  2 

entire  image  in  the  VAX  is  C^N  ,  while  the  time  required  to 
perform  it  on  ZMOB  is  2rN2  +  c^P/.:56.  Evidently,  if 
5L2r  +  (’z  25f>t’v,  using  ZMOB  is  adva  n  tagoous . 

The  situation  is  more  complicated  when  we  deal  with 
local  operations,  in  which  the  result  for  a  given  pixel 
depends  on  the  values  of  the  pixel  and  a  set  of  its 
neighbors.  Here,  if  we  partition  the  image  into  disjoint 
parts,  exchange  of  information  between  ZMOB  processors  is 
necessary,  and  the  amount  of  exchange  depends  on  the  shapes 
of  the  parts.  Alternatively,  we  can  divide  the  image  into 
overlapping  parts,  such  that  for  every  pixel  there  exists 
a  processor  that  contains  the  pixel  and  its  neighbors.  This 


makes  data  exchange  unnecessary  when  the  local  operation 
is  performed  only  once;  but  if  the  operation  must  be 
iterated,  as  is  often  the  case,  the  amount  of  overlap 
needed  may  become  excessive. 

Section  2.1  discusses  the  optimal  choice  for  the 
shapes  of  the  parts,  and  concludes  that  square  blocks  are 
best,  at  least  for  all  the  standard  types  of  neighborhoods 
used  in  local  operations.  Section  2.2  discusses  the  amount 
of  overlap  and  shows  that  the  least  possible  overlap  is 
always  optimal.  Section  2.3  discusses  the  relative  merits 
of  performing  an  (iterated)  operation  on  ZMOB  or  on  the  host 
VAX  itself. 


2 . 1  Optimal  Region  Shape 

Iterated  local  operations  performed  on  ZMOB  involve 
cycling  between  two  states:  computation,  where  each 
processor  performs  calculations  on  data  in  its  local 
memory,  and  communication,  where  some  or  all  processors 
pass  information  between  themselves  in  synchrony.  With 
iterated  local  operations,  this  information  will  lie  at 
the  border  of  the  image  subreqion  contained  in  each 
processor.  In  the  following  sections,  we  discuss  the 
following  question:  given  that  at  each  iteration  processors 
must  pass  appropriate  border  information,  what  is  the  optimal 
image  subreqion  shape? 

2.1.1  Optimal  Rectangle:  Strips  vs.  Squares 

The  first  question  is  whether  squares  are  the  best 
rectangles;  intuitively,  this  is  so,  because  (in  the  8- 
neighbor  case)  a  one-thick  border  around  the  region  must 
be  passed  at  each  iteration,  and  a  square  has  the  smallest 
perimeter  of  any  rectangle  having  the  same  area  (Fig.  1) . 

More  rigorously,  let  A  be  the  subreqion  area  and  ?  be  the 
rectangle  length,  so  that  A/J.  is  the  rectangle  height- 
Then  C,  the  cost  of  passing  the  region  perimeter,  is  pro¬ 
portional  to 

C(P.)  =  2A/P,  h  2  9  4  4 

To  optimize  for  l,  we  differentiate  and  set  to  0: 
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Since  8.  =  A  describes  a  square,  it  is  the  optimal  rectangle. 

2.1.2  The  8-Neighbor  Case:  Comparison  of  Square,  Diamond, 
Triangular,  and  Circular  Shaped  Regions 

Now  that  squares  have  been  shown  to  be  the  best  rec¬ 
tangles  for  local  operations,  diamonds,  triangles,  and 
circular  regions  will  be  compared  for  efficiency  too  (without 
regard  for  the  potential  difficulty  of  performing  the 
subdivision).  The  statistics  compared  will  be  the  perimeter- 
to-area  ratio,  the  fraction  of  overhead  spent  passing  data 
instead  of  the  real  work,  computing.  Figure  2  graphically 
illustrates  the  border  size  calculation,  and  Table  1  contains 
the  perimeter/area  ratios. 

As  presented,  the  data  shows  squares  better  than 
triangles  better  than  diamonds.  Circles,  in  the  limit, 
arc  as  good  as  squares,  but  for  realistic  values  come  out 
worse  (in  Fig.  2,  the  circle  has  area  49,  perimeter  40, 
with  the  equivalent  square's  perimeter  32),  not  to  mention 
the  image  subdivision  problem.  Again,  squares  are  best. 

2.1.3  Other  Neighborhoods 
a.  The  4-Neighbor  Case 

Figure  3  illustrates  the  algebraic  relationship  between 
perimeter  size  and  area  for  square,  diamond  and  triangular 
regions  in  the  4-neighbor  case.  Since  the  neighborhood  is 
.symmetrical,  the  triangle  produces  the  same  result  in  any 
orientation.  Table  2  shows  that  a  square  is  superior  to 


either  a  triangular  or  diamond  region. 


b .  The  2x2  Neighborhood 

This  neighborhood  shape  is  used  in  the  Roberts 
gradient  and  in  shrinking  and  shifting  operations.  Figure 
3  and  Table  2  again  show  that  a  square  region  subdivision 
is  best  for  these  operations, 
o.  Other  Asymmetric  Neighborhoods 

The  two  types  of  neighborhoods  used  in  the  standard 
connected  component  labeling  operation,  for  8-  and  4-connoc 
ness,  are  shown  in  Figure  3.  Once  more,  Table  2  shows  that 
a  square  is  best. 
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2 . 2  Optimal  Region  Overlap 

Once  the  region  shape  has  been  decided,  the  next 
question  is  how  to  best  coordinate  the  cycling  between 
computation  and  communication  in  the  course  of  iterated 
local  operations.  In  particular,  additional  border  informa¬ 
tion  is  required  for  each  iteration  of  a  local  operation. 

Is  it  best  to  pass  several  layers  of  border  information  at 
once  and  then  compute  on  them,  or  just  one  layer  at  a  time? 

The  answer  is:  just  one  layer  at  a  time.  Intuitively, 
the  more  layers  we  pass  at  each  stage,  the  larger  each 
successive  layer  gets  (Fig.  4).  Likewise,  the  amount  of 
computation  grows  for  each  successive  layer  of  border 
points  added  to  the  region  (Fig.  5).  Thus,  we  cannot 
gain  by  passinq  more  than  one  layer  at  a  time  between 
computations,  anti  the  pass  one  layer  -  compute  one  iteration 
strategy  is  optimal. 

More  formally,  let  i  the  total  number  ot  iterations 
to  be  performed,  j  =  the  number  of  iterations  to  be  per¬ 
formed  at  each  step  (the  variable  to  be  optimized) ,  p  = 
the  time  to  pass  one  point,  t  =  the  time  to  perform  one 
local  operation,  and  n  =  the  side  length  of  the  square  we 
arc  dealing  with.  The  communication  time  for  one  round 
i  s 

4 (nj+j2) p 


The  computation  t i mo  for  one  round  is 


t  V.  (n  +  2k)“,  or  (as  a  polynomial  in  j ) 

k-0 

t(ij3+(2n-2) j2+(n2-2n+|) j) 

Adding  these  together,  and  multiplying  by  i/ j ,  which  is 
the  total  number  of  rounds  to  complete  i  iterations,  gives 
us  a  total  time  of 

-  ti  j3  +  2i  (t  (n-1)  +2p)  j  +  ( ti  (n"-2n+^-)  +4inp) 
Differentiating,  setting  to  zero,  and  solving  for  j  gives 
j  =  -(|(n-l)+§  |) 

A  negative  optinal  value  of  j  implies  that  we  should  use 
the  minimum  legal  value,  and  j  should  be  ]. 


2.3  Timing:  VAX  vs.  ZMOB  Computation  Tradeoff 


When  is  it  better  to  use  ZMOB  rather  than  simply  using 
the  host  VAX?  In  other  words,  when  does  the  overhead  of 
using  ZMOB  (loading  and  unloading  an  image  to/from  the 
processors  via  the  conveyor  belt)  offset  the  time  saved  in 
performing  the  (iterated)  operation?  To  answer  this, 
we  must  first  obtain  formulas  for  computation  times  on 
VAX  and  ZMOB.  The  variables  will  be: 
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N  =  length  of  image  side  (area  =N  ) 

P  =  number  of  processors  in  ZMOB  (256) 
p  =  time  to  pass  one  pixel  between  ZMOB  processors 
C  =  time  to  compute  one  local  operation  on  ZMOB 
CZ=  time  to  compute  one  local  operation  on  VAX 
nV=  ZMOB  square  region  side  (n2=N2/P) 
m  =  number  of  iterations  of  local  operation 
r  =  time  to  pass  one  pixel  over  the  UNIBUS 

2.3.1  Vax  and  ZMOB  Computation  Times 


On  the  VAX,  the  time  to  compute  m  iterations  of 
a  local  operation  which  takes  Cv  time  per  pixel  is 

tvax=  ”'cvn2 

On  ZMOB,  the  computation  must  be  split  into  three 
stages:  loading  (Lz) ,  processing  (P?) ,  and  unloading  (Uz) . 

a.  Loading  and  Unloading 


Each  processor  in  ZMOB  may  be  loaded  simultaneously 

from  the  VAX  over  the  conveyor  belt;  each  processor's 
2 

subregion  of  N  /256  points  is  loaded  at  the  transfer  rate 

of  the  conveyor  belt,  p.  However,  the  loading  time  is  limited 

2 

by  the  time  it  takes  to  pass  the  entire  N  image  points  between 
the  VAX  and  ZMOB  over  the  UNIBUS;  this  occurs  at  the  UNIBUS  trans¬ 
fer  rate(r).  Loading  and  unloading  times  are  the  same: 

L  =U  =rN2 
z  z 


Tables  3  and  4  show  typical  results  for  the  realistic  values 

N  =  512 
P  =  256 

p  =  10-5sec.  ( lOpsec/byte  ZMOB  transfer 

_  rate;  a  conservative  estimate) 
r  =  4  *  10  sec  (400  nscc/byte  UNIBUS 
transfer  rate) 

Table  3  q  i  ves  minimum  ZMOB  compulation  times  for  i'VAX  = 
and  Table  4  gives  minimum  times  for  TyAX  -  l°Tr/MOB 


We  can  see  from  these  tables  that,  since  the  smallest 

value  for  C  is  the  time  required  for  one  Z80  instruction  or 
z 

about  one  microsecond  (10  ^  sec.),  ZMOB  will  almost  always 
be  advantageous  and  will  often  be  more  than  ten  times  faster 
than  the  VAX.  We  can  also  see  this  in  the  following  list  of 
fractional  overhead  values  ( the  ratio  of  ZMOB  loading  and 
unloading  time  to  the  total  processing  time)  for  a  (one- 

iteration)  local  operation:  98.9%  when  C  equals  10  ^  sec. 

-  5 

(around  one  instruction),  94.8",  at  Cz  =  10  sec.,  66.9"  at 

-4  -3  -2 

C  =10  sec.,  17.0%  at  C  =10  sec.,  and  2.0%  at  C  =10  sec. 
z  z  z 

the  ratio  drops  well  below  1%  for  more  than  one  iteration  or 
larger  values.  Thus,  even  for  once-performed  local  opera¬ 
tions,  ZMOB  loading  and  unloading  overhead  is  relatively 
small,  and  since  Cz/256^<Tv  (usually),  the  use  of  ZMOB  will 
ordinarily  be  advantageous. 


3. 


Two-d imensio n in  1  pis ere L o  *1* ran sfornis 
Tlu'  method  deser  i  bod  below  ralrul.ito;;  t  lie  two-dimen¬ 
sional  Fourier  transform  (or  other  similar  diserete 
transforms)  of  an  N  by  N  imago  in  0(N  log  N)  time.  Each 
processor  is  assigned  a  subregion  of  consecutive  rows  of 
the  image.  The  process  is  composed  of  three  stops:  a 
row-wise  fast  Fourier  transform  (FFT)  by  each  processor; 
transposition  of  the  image  (matrix)  between  processors; 
and  a  (now)  column-wise  FFT.  Executing  the  FFT  on  each 
row  held  by  the  processor  is  straightfoward  and  performed 
in  0(N  log  N)  time.  Transposition  of  the  image  to  perform 
the  column-wise  transform  is  accomplished  as  follows: 
each  processor  is  destined  to  receive  a  portion  of  each 
row  during  the  course  of  the  transposition,  with  one  portion 
remaining  in  the  processor.  Processor  i  passes  the  portion 
to  go  to  processor  i+1,  which  can  be  determined  by  computa¬ 
tion,  during  the  first  communication  round;  this  quantity 
may  be  several  elements  (and  several  bytes  per  element) . 
During  the  second  round,  processor  i+2  receives  its 
portion  from  processor  i,  and  so  on,  until  255  rounds  have 
been  completed.  Each  processor  now  contains  one  or  more 
columns.  The  process  is  illustrated  in  Figure  6.  Each 
row  and  column  takes  0(N  log  N)  time  to  be  transformed; 
each  processor  contains  N/256  rows  or  columns,  but  since 


N  is  bounded  by  the  image  size  that  ZMOB  can  realistically 
hold,  this  can  be  reqarded  as  a  constant.  The  transposi¬ 
tion  process  takes  O(N)  time  to  transfer  the  elements  of 
one  or  more  rows  to  other  processors.  Thus  the  entire 
algorithm  takes  time  proportional  to  2N  1 oq  N  +  N  time, 
or  0 (N  log  N) . 


-«U  - 
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4.  Geometric  Correction 

The  problem  of  performing  geometric  corn'd  ion  ol 
imaqe  in  parallel  using  ZMOB  involves  each  processor 
receivinq  information  about  the  input  imaqe  from  other 
processors  for  each  point  in  the  output  subreqion  assiqned 
to  that  processor.  The  value  of  each  output  point  is 
computed  by  interpolation  from  the  values  of  a  set  of  input 
points  surrounding  an  ideal  input  point,  usually  having 
non-inteqer  coordinates,  whose  position  is  defined  by  the 
inverse  of  the  given  coordinate  transformation.  For 
example,  for  bilinear  interpolation  we  use  a  2  by  2  neigh¬ 
borhood  of  the  ideal  input  point,  while  for  cubic  spline 
interpolation  we  use  a  4  by  4  neighborhood. 

One  desirable  condition  for  efficient  geometric 
correction  on  ZMOB  is  to  have  each  interpolation  neighbor¬ 
hood  reside  entirely  within  one  processor,  so  that  no  mere 
than  one  need  be  consulted  to  obtain  an  output  image  value. 
This  may  be  insured  by  providing  suitable  overlaps  between 
the  subregions  handled  by  the  processors  (e.g.,  a  one- row 
border  for  a  2  by  2  neighborhood).  However,  there  is  no 
way  of  guaranteeing,  in  general,  that  we  can  compute  the 
output  values  in  such  a  way  that,  at  each  step,  each 
processor  needs  information  from  a  different  processor. 

As  a  result,  the  communication  between  processors  will  not 
be  evenly  distributed,  and  it  becomes  impossible  to  a  1 ve  an 


% 


exact  estimate  of  the  time  required.  Only  in  the  special 
case  where  the  pixel  displacement  is  bounded  by  some 
distance  d,  it  becomes  possible  to  provide  an  overlap 
between  processors  proportional  to  d,  thus  allowinq  each 
processor  to  compute  its  portion  of  the  output  imaqe 
without  consultinq  other  processors. 


r> .  Compulation  of  Imago  Statistics 
S.l  Imago  llistoiiram  Al<|oritlim 

Wo  first  consider  the  problem  of  ereat  i  ng  f.he  grey- 
levcl  histogram  of  an  image  in  7,  MOB  (either  freshly 
loaded  from  the  VAX  or  already  present  after  a  series  of 
previous  image  operations).  In  the  algorithm  to  be 
described  below,  the  goal  is  for  each  of  the  2r>6  processors 
to  contain  the  Frequency  of  occurrence  of  one  of  the  values 
of  the  (eight-bit)  grey  level,  for  an  image  of  arbitrary 
size  (though  with  an  upper  bound,  within  the  constraints 
of  local  memory)  . 

The  method  is  divided  into  two  steps:  local  histogram 
creation  and  histogram  merging.  During  histogram  creation, 
each  processor  creates  a  256-bucket  histogram  for  its 
sub-portion  of  the  total  image,  the  image  area  being 
divided  into  256  equal  parts  ( t  he  strategy  for  partitioning 
is  irrelevant  and  no  overlapping  is  necessary).  Each  bucket 
may  be  of  some  appropriate  size,  say  16  or  24  bits,  which 
will  accommodate  the  largest  possible  value,  or  perhaps  the 
highest  bit  may  bo  reserved  as  a  bucket  overflow  indicator. 
1,'ach  processor  also  has  a  different  (and  larger)  bucket, 
corresponding  to  its  processor  I.D.  and  to  the  grey  level 
that  it  will  be  counting,  that  it  is  responsible  for 
totalling  during  the'  next  step. 


Durinq  the  histogram  merging  phase,  each  processor 
will  pass  the  contents  of  each  histogram  bucket  (other 
than  its  own)  to  the  appropriate  processor  for  total  linn 
during  255  communications  roiitids.  Koch  proces.soi  already 
has  the  initial  count  for  its  own  bucket.  Muring  the 
first  round,  processor  i  passes  the  contents  of  bucket 
i+1  (module  256)  to  processor  i+1  for  totalling.  On  the 
second  round,  bucket  i+2  is  passed  to  processor  i+2,  and 
so  on.  After  all  255  bucket  values  belonging  to  other 
processors  are  passed,  they  are  disregarded  and  the 
processor's  own  final  value  is  returned  to  the  VAX. 


5 . 2  Co-occurrence  Matrix  Computation 


The  problem  of  computing  co-occurrence  matrices  is 
very  similar  to  that  of  histogramming .  Each  co-occurrence 
matrix  element  is  a  frequency  of  a  pair  of  grey  levels 
occurring  at  a  particular  distance  and  orientation  from 
one  another,  just  as  each  element  of  the  histogram  (vector) 
is  the  frequency  of  a  single  grey  level  occuring  at  any 
pixel-  The  one  difference  is  that  a  co-occurrence  matrix 
is  potentially  much  larger  (the  square  of  the  total  number 
of  grey  levels).  Usually,  the  range  of  grey  levels  used 
is  more  restricted  than  in  the  histogram  case  --  e.g.,  we 
use  only  the  upper  five  or  six  bits  of  the  grey  value. 
Another  difference  is  that  the  geometry  of  the  pixel  pair 
calls  for  the  use  of  appropriate  overlap  when  storing  the 
image  subregions  in  the  processors  (see  Figure  7) .  In 
particular,  if  each  pixel  is  compared  with  one  m  units 
horizontally  and  n  units  vertically  displaced,  we  can 
use  square  subregions  with  m  columns  and  n  rows  of  over¬ 
lapping.  This  obviates  the  need  to  request  information 
from  other  processors  during  the  course  of  the  computa¬ 
tion,  at  a  great  savings  of  time  with  a  small  cost  of 
extra  memory  used.  The  process  then  proceeds  similarly 
to  the  histogramming  algorithm:  each  processor  computes 
a  co-occurrence  matrix  for  its  subregion;  each  processor 
is  assigned  l/256th  of  the  matrix  elements  (arbitrarily) 


to  tot <1 1 ;  and  throuqh  2ri'j  rounds  of  communication,  each 
processor  semis  ouch  other  processor  its  port  ion  ot  the 
matrix  to  total,  and  receives  tin*  other  2‘>'>  values  toi 
its  own  matrix  portion.  Each  matrix  portion  will  probably 
consist  of  several  matrix  elements,  each  potentially 
several  bytes  long.  After  the  total linq  is  completed, 
each  processor  communicates  its  portion  to  the  VAX  where 
the  final  matrix  is  assembled. 


. .  . 


6 .  Concluding  Remarks 

We  have  seen  that  ZMOB  should  have  substantial 
speed  advantages  in  many  image  processing  situations. 

In  particular,  we  have  outlined  efficient  ZMOB  communica¬ 
tion/computation  schemes  for  point  and  local  operations 
(with  particular  reference  to  how  the  data  should  be 
partitioned  among  the  processors),  discrete  transforms, 
geometric  operations  (in  some  cases),  and  computation  of 
statistics.  These  schemes  demonstrate  that  efficient 
use  of  ZMOB ' s  parallelism  is  possible  for  essentially 
all  basic  image  processing  and  analysis  tasks. 
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